# WritingBench: A Comprehensive Benchmark for Generative Writing

## 📖 Overview
WritingBench is a comprehensive benchmark for evaluating LLMs' writing capabilities across **1,000 real-world queries**, spanning:
- 6 primary domains 
- 100 fine-grained subdomains
- 1,500+ avg. tokens per query

WritingBench integrates diverse sources of materials. Each query is paired with **5 instance-specific criteria**, scoring either through LLM evaluators or through a finetuned critic model.

## 🛠 Installation
```bash
git clone https://github.com/X-PLUG/WritingBench.git
```

## 🚀 Quick Start

1. Add your API credentials:
- For LLM-as-a-Judge, see `evaluator/llm.py`. Recommend using `Claude-3-7-Sonnet` for evaluation.
```bash
  self.api_key = "your_api_key_here"
  self.url = "Your API endpoint"
  self.model = "Chose your model name"
```
- For critic model, see `evaluator/critic.py`
```bash
  self.model = LLM(
      model="", # Your local path. Please download critic model from https://huggingface.co/AQuarterMile/WritingBench-Critic-Model-Qwen-7B.
      tensor_parallel_size=1, # Your tensor parallel size setting. Defaults to 1, indicating no parallelism
  )
```
2. Choose appropriate evaluation sets from `benchmark_query/`
```bash
python evaluate_benchmark.py \
  --evaluator critic \ # or claude
  --query_criteria_file query_set.jsonl \ # use files under benchmark_query/
  --input_file samples.jsonl \
  --output_file scores.jsonl
```

An example of `samples.jsonl` used to store responses generated by the evaluated LLMs:
```bash
{"index": i, "response": "xxx"}
```

## 🏗️ Benchmark Construction

WritingBench is built through a hybrid pipeline combining **Model-Augmented Query Generation** and **Human-in-the-Loop Refinement**, ensuring both diversity and real-world applicability. The construction process involves two key phases:

### 🤖 Model-Augmented Query Generation

#### Phase 1: Initial Query Generation
Leverage LLMs to generate queries from a two-tiered domain pool grounded in real-world writing scenarios, consisting of 6 primary domains and 100 secondary subdomains, covering:
   - 🔬 Academic & Engineering
   - 💼 Finance & Business
   - ⚖️ Politics & Law
   - 🎨 Literature & Art
   - 🎓 Education
   - 📢 Advertising & Marketing

####  Phase 2: Query Diversification
Enhance the diversity and practical applicability of queries by random selected strategies from **Query Refinement Guidance Pool**, covering:
- Style Adjustments (e.g., kid-friendly tone)
- Format Specifications (e.g., IEEE template)
- Length Constraints (e.g., 500-word summary)
- Personalization (e.g., educator's perspective)
- Content Specificity (e.g., 2023 Q3 metrics)
- Expression Optimization (query rewriting)

### ✍️ Human-in-the-Loop Refinement

#### Phase 1: Material Collection
30 trained annotators collect necessary open-source materials (e.g., public financial statements or legal templates), guided by material requirements generated by LLMs.

#### Phase 2: Expert Screening & Optimization
5 experts conduct a delicate two-stage filtering process: 
- query adaptation: ambiguous or unrealistic queries are revised to better align with the provided materials and practical scenarios
- material pruning: redundant or irrelevant content is eliminated from the collected materials

## 📈 Evaluation Framework

### Phase 1: Dynamic Criteria Generation
Given a query $q$ in the WritingBench, the LLM is prompted to automatically generate a set of five evaluation criteria, $C_q = \{c_1, \ldots, c_5\}$. Each criterion comprises three components: a concise name summarizing the criterion, an extended description elaborating on the evaluation focus, and detailed scoring rubrics.

### Phase 2: Rubric-based Scoring
For each criterion $c_i \in C_q$, the evaluator independently assigns a score on a 10-point scale to a response $r$, providing both a score and a justification.
